Data analysis of Airbnb listings in Boston using the CRISP-DM process.
AirBnb is a popular choice for stay across the world as it provides a way to travel in a personalized way. I have used Boston, one of the popular cities for Airbnb to analyze. I have analyzed the listings offered, the price trend, and the amenities provided. The dataset is available here. For more ideas, visualizations of all Boston datasets you can look here.
Boston is one of the most visited cities in the world. I will discuss some questions that those who plan to travel to Boston and use Airbnb would find interesting.
I have used CRISP-DM (CRoss Industry Standard Process for Data Mining) in this process. It is a process model with six phases that naturally describes the data science life cycle. The phases are:
I have tried to answer following questions:
# data from http://insideairbnb.com/get-the-data.html
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import folium
import json
import numpy as np
sns.set(rc={'figure.figsize':(11, 4)})
calendar = pd.read_csv('data/calendar.csv', index_col='date', parse_dates=True)
listings = pd.read_csv('./data/listings.csv', parse_dates=True)
reviews = pd.read_csv('./data/reviews.csv', parse_dates=True)
print(f" There are {calendar.shape[0]} entries in Calendar and {calendar.shape[1]} columns")
# the calendar dataset
calendar.head()
calendar.available.value_counts()
calendar[~calendar.price.isnull()].head()
Calendar dataset contains date, listing_id, available and price. Available is column which tells up about the availability of on that date, and price is price in $
print(f"There are {listings.shape[0]} listings in Boston")
# listing dataset
listings.head()
listings[['amenities', 'price']]
In listings amenities we can see there are some quotes and for some there is no quotes. So we have to clean that part, also the price has '$' suffix.
# review dataset
reviews.head()
# formating and type casting the price
listings['price'] = listings['price'].str[1:].str.split(',').str.join('').astype(float)
# formatting the amenities in a list format
listings['amenities'] = listings['amenities'].map( lambda x : [elem.strip('""') for elem in x.strip('{}').split(',')])
# adding required columns to calendar
def add_calendar_features(calendar):
'''
INPUT:
calendar - the pandas dataframe containing
calendar listings information
OUTPUT:
df - the pandas dataframe with newly added columns
Year, Month, month_name, Weekday Name, Weekday No,
weekofyear, price formatted
'''
calendar['Year'] = calendar.index.year
calendar['Month'] = calendar.index.month
calendar['month_name'] = calendar.index.month_name()
calendar['Weekday Name'] = calendar.index.weekday_name
calendar['Weekday No'] = calendar.index.weekday
calendar['weekofyear'] = calendar.index.weekofyear
calendar['price'] = calendar['price'].str[1:].str.split(',').str.join('').astype(float)
df = calendar[calendar['available']=='t']
return df
calendar = add_calendar_features(calendar)
calendar.head()
We will skip section 4. Data modeling as our questions doesn't require this part.
Analyse Visualise Brief explanation for visualisation
df = listings.groupby(['neighbourhood_cleansed']).id.count().reset_index()
df.columns = ['neighbourhood', 'listings_count']
df.sort_values('listings_count', ascending=False)
listings.groupby('room_type').id.count()
## here we are ploting the choropleth map for listing count in a neighborhood
## and then the latitude and logitude from listings to a folium map
## 1. create a folium map instance
m = folium.Map(location=[42.33, -71.07], zoom_start=11.5)
## 2. crunch listing to get neighborhood listing count
df = listings.groupby(['neighbourhood_cleansed']).id.count().reset_index()
df.columns = ['neighbourhood', 'listings_count']
## 3. open the geojson file for neighborhood data
with open('data/Boston_Neighborhoods.geojson') as boston_json:
b_json = json.load(boston_json)
## 4. plot the choropleth and add it to the map
folium.Choropleth(geo_data=b_json,
name='choropleth',
data=df,
columns=['neighbourhood','listings_count'],
fill_color='YlOrRd',
key_on ='properties.Name',
fill_opacity=0.6,
line_opacity=0.2,
legend_name = 'listings count'
).add_to(m)
## 5. encode each room type with color
colors = {
"Entire home/apt" : "blue",
"Private room" : "green",
"Shared room" : "red"
}
listings['room_type_color'] = listings['room_type'].apply(lambda x: colors[x])
## add the listing to the map by their location
listings.apply(lambda row:folium.CircleMarker(location=[row["latitude"], row["longitude"]],
radius=1, color=row['room_type_color'], popup=row['neighbourhood'])
.add_to(m), axis=1)
# display map
m
## ploting the count of each of the property type
data = listings.groupby(['property_type']).id.count().reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
data.columns = ['property type', 'number of properties']
data = data.sort_values('number of properties', ascending=False)
g = sns.barplot(data=data, x='property type', y='number of properties')
plt.xticks(rotation=90)
plt.show()
We can clearly see that there is a concentration of listings in Back Bay and South End. The color of the neighborhood is encoded with the listings count in that neighborhood. Using that we can see that Jamaica Plain(343), South End(326), Back Bay(302) are among the top three in listings count.
entire room/apartment is one of the prominent offerings (2127), the second is Private room(1378) and lastly shared room(80). With respect to property type Apartment(over 2500 listings) is at first, then house(over 500) and then condo, town house, bed & breakfast and so on.
Review and host response and avaialibility are columns in listings dataframe.
First looking for listings availabilities:
# groupby neighborhood to get all mean availability days in each category
df = listings.groupby(['neighbourhood_cleansed']).agg({'availability_30':'mean', 'availability_60':'mean', 'availability_90':'mean', 'availability_365':'mean'}).sort_values('availability_365').reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
# melt the dataframe
data = pd.melt(df, ['neighbourhood_cleansed'])
data.columns = ['neighborhood', 'label', 'number of times']
# doing a line plot
g = sns.lineplot(data=data, x='neighborhood', y='number of times', hue='label', style='label')
plt.xticks(rotation=90)
plt.show()
# groupby neighborhood to get all mean rating score in each category
df = listings.groupby('neighbourhood_cleansed').agg({'review_scores_cleanliness':'mean',
'review_scores_checkin':'mean', 'review_scores_communication':'mean',
'review_scores_rating':'mean'})\
.sort_values('review_scores_checkin').drop('review_scores_rating', axis=1).reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
data = pd.melt(df, ['neighbourhood_cleansed'])
data.columns = ['neighborhood', 'label', 'avg Rating']
# ploting
g = sns.lineplot(data=data, x='neighborhood', y='avg Rating', hue='label', style='label')
plt.xticks(rotation=90)
plt.show()
# groupby neighborhood and host response time to get all count of listings in each category
df = listings.groupby(['neighbourhood_cleansed', 'host_response_time']).agg({'id':'count'}).sort_values(['host_response_time','id']).reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df.columns = ['neighborhood', 'host_response_time', 'count']
# getting the listing count in each neighborhood
df2= listings.groupby('neighbourhood_cleansed').id.count().reset_index()
df2.columns = ['neighborhood', 'listings count']
# joining two datframe
df = pd.merge(df, df2, on='neighborhood')
# calculating the response count per listing for the plot
df['response per listings'] = df['count'] / df['listings count']
df = pd.pivot_table(df, values='response per listings', index=['neighborhood'], columns=['host_response_time']).sort_values('within an hour')
df = pd.melt(df.reset_index(), id_vars=['neighborhood'], value_vars=['a few days or more', 'within a day', 'within a few hours', 'within an hour'])
df.columns = ['neighborhood', 'host_response_time', 'response per listings']
#ploting
g = sns.lineplot(data=df, x='neighborhood', y='response per listings', hue='host_response_time', style='host_response_time')
plt.xticks(rotation=90)
plt.show()
df
Price is also given in listings and calendar dataframe, But here I will be using price data from calendar for calculating the mean price among each neighborhood and in each month. For getting the weekly trend I have done the same thing week wise
# join listing data and calendar data for available days
df = pd.merge(listings[['id', 'neighbourhood_cleansed']], calendar[['listing_id', 'price', 'month_name']], left_on='id', right_on='listing_id')
# group the data by each neighborhood and month and calculate price mean
df = df.groupby(['neighbourhood_cleansed', 'month_name']).agg({'price':'mean'}).sort_values('price').reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df.columns = ['neighborhood', 'month', 'price ($)']
# plot the data
g = sns.lineplot(data=df, x='neighborhood', y='price ($)', hue='month', style="month")
plt.xticks(rotation=90)
plt.show()
# join listing data and calendar data for available days
df = pd.merge(listings[['id', 'neighbourhood_cleansed']], calendar[['listing_id', 'price', 'Weekday Name']], left_on='id', right_on='listing_id')
# group the data by each neighborhood and week and calculate price mean
df = df.groupby(['neighbourhood_cleansed', 'Weekday Name']).agg({'price':'mean'}).sort_values('price').reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df.columns = ['neighborhood', 'Weekday Name', 'price ($)']
g = sns.lineplot(data=df, x='neighborhood', y='price ($)', hue='Weekday Name', style="Weekday Name")
plt.xticks(rotation=90)
plt.show()
# extract out the available listings from calendar
available = calendar
sns.set_theme(style="whitegrid")
# plot the bar for each week using seaborn
ax = sns.barplot(x="Weekday Name", y="price", data=available.sort_values("Weekday No"))
ax.set_ylim(190,205)
calendar.groupby('Weekday Name').price.mean().reset_index().sort_values('price', ascending=False)
There are three types of room offered in Boston,
# merge listings and calendar data for available days
df = pd.merge(listings[['id', 'neighbourhood_cleansed', 'room_type']], calendar[['listing_id', 'price']], left_on='id', right_on='listing_id')
# group by neighborhood and room type and calculate mean of each group
df = df.groupby(['neighbourhood_cleansed', 'room_type']).agg({'price':'mean'}).sort_values(['room_type','price']).reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df.columns = ['neighborhood', 'room_type', 'price ($)']
# plot
g = sns.lineplot(data=df, x='neighborhood', y='price ($)', hue='room_type', style='room_type')
plt.xticks(rotation=90)
plt.show()
Count the occurance of each amenities along each room type and then calculate the fraction by dividing with its listing count
# explode amenities
df = listings[['room_type', 'amenities']].explode('amenities')
# set occurance/count as 1
df['count'] = 1
df = pd.pivot_table(df, values='count', index=['amenities'], columns=['room_type'], aggfunc={'count':np.sum}).sort_values('Entire home/apt', ascending=False)
# calculating fraction
df = (df/listings.groupby('room_type').id.count()).sort_values(['Entire home/apt', 'Private room', 'Shared room'], ascending=False)
df = df[df.index != '']
print(f"Amenities count {df.shape[0]}")
Printing top 10 available amenities in boston
df.head(10)
Printing least 10 available amenities in boston
df.tail(10)
Trying to see the difference in groups using box plot
normal_amenities = set(df.head(10).index.values)
listings['has_all_common_amenities'] = listings.amenities.apply(lambda x: len(normal_amenities - set(x)) == 0)
sns.set(rc={'figure.figsize':(10,7)})
sns.boxplot(x="has_all_common_amenities", y="price", data=listings)
Printing the mean price of listings with all common amenities
listings.groupby('has_all_common_amenities').price.mean().reset_index()
listings has host_since column which depicts the date at which that listing was made available. If we can calculate the difference of host_since and the most recent listing date, we can get the host since information in days. Using this we can filter the listings which were listed in last, say, 90 days and do our analysis for calculate count of new listing of percentage change in each neighborhood
# using 90 days
days_old = 90
listings['host_since'] = pd.to_datetime(listings['host_since'])
# calculating the most recent listing date
max_host = listings['host_since'].max()
# calculating host since in days
listings['host_since(days)'] = (max_host - listings['host_since']).dt.days
# getting the count of all new listings in each neighborhood
df = listings[listings['host_since(days)']<days_old].groupby(['neighbourhood_cleansed']).id.count().reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df.columns = ['neighbourhood', 'new_listings_count']
df2 = listings[listings['host_since(days)']>=days_old].groupby(['neighbourhood_cleansed']).id.count().reset_index()
df2.columns = ['neighbourhood', 'old_listings_count']
df = pd.merge(df, df2, on="neighbourhood")
# calculating the percentage
df['percentage'] = df['new_listings_count']/df['old_listings_count']
df = df.sort_values('percentage', ascending=False)
df = df.sort_values('new_listings_count', ascending=False)
# setting up some variable for plotting
font_color = '#525252'
hfont = {'fontname':'Calibri'}
facecolor = '#eaeaf2'
color_red = '#fd625e'
color_blue = '#01b8aa'
index = df.neighbourhood
column0 = df['percentage']
column1 = df['new_listings_count']
title0 = '% change'
title1 = 'new listings count'
# intializing plot
fig, axes = plt.subplots(figsize=(10,5), facecolor=facecolor, ncols=2, sharey=True)
fig.tight_layout()
# ploting both count and percentage data
axes[0].barh(index, column0, align='center', color=color_red, zorder=10)
axes[0].set_title(title0, fontsize=18, pad=15, color=color_red, **hfont)
axes[1].barh(index, column1, align='center', color=color_blue, zorder=10)
axes[1].set_title(title1, fontsize=18, pad=15, color=color_blue, **hfont)
# If you have positive numbers and want to invert the x-axis of the left plot
axes[0].invert_xaxis()
# To show data from highest to lowest
plt.gca().invert_yaxis()
for label in (axes[0].get_xticklabels() + axes[0].get_yticklabels()):
label.set(fontsize=13, color=font_color, **hfont)
for label in (axes[1].get_xticklabels() + axes[1].get_yticklabels()):
label.set(fontsize=13, color=font_color, **hfont)
plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)
plt.show()
df
#recent
days_old = 90
listings['host_since'] = pd.to_datetime(listings['host_since'])
max_host = listings['host_since'].max()
listings['host_since(days)'] = (max_host - listings['host_since']).dt.days
df = listings[listings['host_since(days)']<days_old].groupby(['room_type']).id.count().reset_index()
df.columns = ['neighbourhood', 'new_rooms']
#overall
df2 = listings[listings['host_since(days)']>=days_old].groupby(['room_type']).id.count().reset_index()
sns.set(rc={'figure.figsize':(15,8.27)})
df2.columns = ['neighbourhood', 'all_rooms']
df = pd.merge(df, df2, on="neighbourhood")
df['percentage'] = df['new_rooms']/df['all_rooms']
df = df.sort_values('percentage', ascending=False)
g = sns.barplot(data=df, x='neighbourhood', y='percentage', palette="Set1")
sns.set(rc={'figure.figsize':(15,8.27)})
# plt.xticks(rotation=45)
plt.show()
df